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Research Findings: This study provides the fast independent investigation ol the second nu»l widely 
used multKlunensnoal assessment in Head Start—the Preschool Child Observation Record. Second 
Editicei (COR-2). We conducted a comprehensive investigation into the validity ol the COR-2 using 
data Horn all children in an urban school distract's Heal Start program (N = 4.071 (. Confirmatory fac¬ 
tor analysis revealed a misfit between the 6 developer-defined categories and the data. Although 
exploratory analyse! revealed a possible 4 factor solution, subsequent analysts indicated problems 
with this structure as well. Item response theory methods were used to determine whether there 
was support lor the S-potnl response scale ol each item representing an appropriately sequenced 
set of shill points. Results indicated that nearly half ol the COR-2 items had reversed c* poorly spaced 
thresholds, suggesting potential problems with these items' lunctioning. Practice or Policy: Specific 
implications of the findings lor the lurther development ol the COR-2 in terms of its constructs and 
items as well as general implications for early childhood assessment are discussed. 

Results from the Early Childhood Longitudinal Study's Birth cohort indicate that children from 
families living in poverty start kindergarten substantially behind more economically advantaged 
children in reading and mathematics (Denton Flanagan. McPhcc. & Mulligan. 2009). 1110 
National Head Start program is the federal government's response to close these achievement 
gaps for children from low-income households by ensuring that these children are ready to start 
school. To meet its objectives, the Head Start program was developed based on. and continues 
to be guided by. developmental science theory and research. However, it was not until the 
Improving Head Start for School Readiness Act of 2007 was enacted that an explicit mandate 
was made to use scientific evidence to inform all aspects of the program (Ziglcr & Styfco. 
2010). Clearly emphasized in this act was a call for the use of scientifically based assessment 
that must. 

(A) be developmental^. linguistically. mil culturally appropriate foe the population serveit (B) 
be reviewed periodically, bared on advances in the science of early childhood development; (C) 
be consistent with relevant, nationally recognized professional and technical standards related to 
the assessment of young children: (D) be valid and reliable in the language in which they are 
administered; (E) be administered by stall with impropriate training for such administration; (F) 
provide for appropriate accommodations lor children with disabilities and children who are limited 

Cooespondrcce regantag this uncle should be addressed to Katherine M. Bargbius. Graduate School ot Bteatson. 
University ot Pennsylvmra. >700 WibwX Street tt alid e lphn. PA 19104. E-mad: borgtaus«upceeedu 
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English proficient: (G) be high-<|uafatv research-bated mtusurev that have been tlemunstralcil lo 

an is! with (be puiposei fee nhkh (bey were devised. (Improving Head Start foe School Readiness 

Act of 2007. Sec. 641A) 

To help meet this call, the U.S. Congress charged the National Research Council (NRC) with 
developing guidelines to evaluate the quality of available early childhood assessments (NRC. 
2008). Informed by developmental science, the NRC committee defined quality in terms of 
the capacity to measure important domains of children's functioning across time (i.c.. cognitive, 
language, physical, social-emotional, and approaches toward learning) and sensitivity to 
unique child characteristics (c.g.. sex. race, and language). The committee provided quality 
criteria for scientifically based assessment by drawing upon the Standards for Educational 
and Psychological Tesring (American Educational Research Association [AERA], American 
Psychological Association (APA). & National Council on Measurement in Education (NCME). 
1999: herein referred to as the Standards). These quality criteria are organized into three cate¬ 
gories—validity, reliability, and unbiasedness. Although each of these quality categories is 
important, the Standards notes that "validity is... the most fundamental conshfcration in devel¬ 
oping and evaluating tests" (AERA. APA. & NCME. 1999. p. 9). Validity is essential because it 
refers to the extent to which there is evidence that test scores can be interpreted as intended 
(AERA. APA. & NCME. 1999). 

According to the Standards, psychomctrically sound assessments have validity evidence 
based on their content, response process, internal structure, relationships to other variables, 
and consequences of their use (AERA. APA. & NCME. 1999). Content validity evidence is 
derived from the systematic process used to develop and evaluate the targeted construct's defi¬ 
nition and corresponding items (Downing & Haiadyna. 1997; Kane. 2006a). Validity evidence 
based on the response process comes from documentation of the extent to which the assessment 
process is free from error (Downing. 2003). Response process validity for observational assess¬ 
ments. which are widely used with young children, is established through evidence that the 
observational process docs not introduce error (Downing. 2003). Validity evidence based on 
the internal structure of an assessment refers to the extent to which the test and items conform 
to the targeted constructs and the intended use (AERA. APA. & NCME. 1999). Assessments that 
aim to capture multiple constructs should have evidence from factor analysis supporting their 
dimensionality (Gorsuch. 2003). In addition, if a measure aims to capture development, it is 
important to determine whether the items accurately reflect an ability continuum. Evidence based 
on an assessment’s relationships to other variables is assessed by correlations between scores on 
the assessment and on other relevant measures (AERA. APA. & NCME. 1999). Finally, ^cord¬ 
ing to the Standards, validity evidence can also be provided based upon the consequences of 
using an assessment. This source of evidence is contentious because it indicates a broader 
definition of validity that includes both the interpretation of test scores and the consequences 
of their use (Cizek. 2012: Cook & Beckman. 2006; Kane. 2006b). Thus, validity experts have 
argued that the consequences of use be considered in concert with, but separate from, other 
aspects of validity evidence (c.g.. Cizek. 2012). 

To provide decision makers with information on the psychometric quality of assessments, 
the Administration for Children and Families commissioned a compendium of widely used 
measures (Halle. Zaslow. WcsscL Moodic. & Darling-Churchill. 2011). The compendium 
included formative assessments that (a) covered three or more of the domains of the Head 
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Sian Child Outcomes Framework- (b) had some evidence base, and (c) were accessible fo* 
general use. Based on these criteria, only eight assessments were selected for review. 1 The 
purpose of this research was to use multivariate statistics and item response theory (IRT) to 
provide validity evidence for one of the eight measures. The Preschool Child Observation 
Record. Second Edition <COR-2). was selected because it is the second most widely used 
assessment in Head Stan (Alkcns cl al.. 2010). it corresponds to many states' early learning 
standards (Epstein. 2006: Epstein & Schweinhan. 2009). and it was used in one of the nation’s 
largest and poorest school districts, which provides a nationally important context within which 
to examine this measure. 

The COR was developed by the HighScopc Educational Research Foundation to measure 
the learning and development of children 2.5 to 6 years old (HighScopc. 1992). According to 
the developers, the COR may be used to monitor the progress of individual children and 
groups of children, to inform curriculum planning and instruction, and to assess the effec¬ 
tiveness of classrooms or programs (HighScopc. 2010). In 1993. the first commercially 
available edition of the COR (COR-1) was released. It contained 30 items organized into 
six categories of development: (a) Initiative, (b) Social Relations, (c) Creative Represen¬ 
tation. (d) Movement and Musk, (c) Language and Literacy, and (f) Logic and Mathematics 
(HighScopc. 1992). These categories corresponded to the five major domains of develop¬ 
ment (i.c.. cognitive, language, physical, social-emotional, and approaches toward learning) 
that arc nationally recognized as important for school readiness (NRC. 2008: Office of Head 
Start. 2010). Each item presented a continuum of five skill points for observers to rate 
children’s level of skill development (HighScopc. 1992). Observers used the categories, 
items, and skill points to classify and rate their observations of children's functioning 
(HighScopc. 1992). 

In 2003. HighScopc released the second edition of the COR (COR-2). which differed from 
the COR-1 in several important ways. Fust, for every item the lowest skill point was changed 
from referring to a child "not yet demonstrating" a skill to exhibiting a "bask exploration” into 
a skill (HighScopc. 2003). Second. COR-1 items were edited to retied cuncnt literature on these 
key areas of development. New items were added to the Language and Literacy category and 
Logk and Mathematics category, which was renamed "Mathematics and Seknee" to reflect 
the changes (Neill. 2004). The final version of the COR-2 contained 32 items organized into 
the revised six categories (HighScopc. 2010). 

A comprehensive search for research on the COR-1 and COR-2 yielded one HighScopc 
report and two published investigations of the psychometric properties of the COR-1 and one 
HighScope report on the COR-2. No detailed technical documentation on the content validity 
of the COR-2 was located. In contrast, some validity evidence for the response process was 
found for the COR-1 and COR-2. HighScopc offers a recommended training program to provide 
instruction, praetke. and feedback to support the accurate use of the measure. However, only 
limited empirical evidence was found on the consistency of the assessment process. For the 
COR-1. Epstein (1993) found high intcrobscrvcr agreement with observers who received 


*Tbe eight assessments selected we«t the (a) Creative Cumeutam Developmental Assessment (b) Galileo Preschool 
Assessment Scales; <c| HighScepe Child Observation Record; (d) Learning Accomplishment Profile Hurd Edit»»; 
(e) Learning Accomplishment Profile Diagnostic; (I) Learning Accomplishment Profile Diagnostic. Spanish Editto; 
(g) Mullen Scales o( Lady Learning; ac1 (h) Wort Sampling System (HiUe ct aL. 2011). 
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intensive training, whereas Schwcinhart. McNair. Bames. and Lamer (1993) found mock rate 
agreement with a sample of Head Start teachers who received less training. For the COR-2. 
moderate intcrobservcr agreement, ranging from .69 to .79. has been found (HighScopc. 2010). 

The Standards notes that the type of analyses used to bring evidence to bear on the internal 
structure of an assessment depends on the intended use and interpretation of the assessment’s 
scores (AERA. APA. & NCME. 1999). Key to the use and interpretation of the COR-1 and 
COR-2 is their ability to capture key school readiness domains and measure progress in each. 
Two studies examined the dimensionality of the COR-1 and one HighScopc report examined this 
for the COR-2. Schwcinhart ct al. (1993) examined the psychometric properties of the COR-1 
using 50 pairs of trained teaching teams to collect data on 484 children. A confirmatory factor 
analysis (CFA) indicated that the six categories did not fit the data well (goodness-of-fit index 
IGF1) = .79: GFI > .90 indicates a reasonable fit Kline. 2005) and that the fetors were highly 
correlated (range = .71-.86). suggesting that some may be redundant (Brown, 2006). Fantuzzo. 
Hightower. Grim, and Montes (2002) studied the validity of the COR-1 using data from 733 chil¬ 
dren enrolled in Head Start and 1.356 children from other preschool programs. Using exploratory 
factor analysis (EFA) and confirmatory cluster analysis. Fantuzzo ct al. (2002) determined that a 
three-factor structure fit both samples (Cognitive Skills. Social Engagement, and Coordinated 
Movement). To date, the psychometric quality of the COR-2 has only been examined by High¬ 
Scopc (2010). In this study. He.*! Start teachers administered the COR-2 to 160 children in the 
spring and to 233 different children in the fall (HighScopc. 2010). Based on principal component 
analysis, a complex four component structure with several items loading on multiple components 
was advanced. 

Unfortunately, all three investigations of the dimensionality of the COR-1 and COR-2 ignored 
the categorical nature of the data and instead improperly treated them as continuous. Using 
standard factor analysis with categorical data is inappropriate because it violates foundational 
assumptions of the model and misrepresents die data (Flora & Curran. 2004). Furthermore, 
treating categorical data as continuous in factor analysis can yield a structure that distorts 
the underlying constructs and docs not replicate across samples (Bernstein & Teng. 1989: 
McDermott ct al.. 2011). McDermott ct al. (2011) noted that such results are nontrivial, as they 
promote interpretation and decision making based on spurious constructs. Advances have been 
made in psychometric science to address these issues, such as using polychoric correlations for 
factor analysis of categorical data (McDermott ct al.. 2011). A further issue with the examination 
of the dimensionality of the COR-2 is that it relied on principal component analysis. Snook and 
Gorsuch (1989) found that principal component analysis yields inflated component loadings and 
standard errors when there arc less than approximately 40 items, which is the case for the COR-2. 
Finally, the examination of the COR-2 used a sample of approximately 200 children, which is 
half the size recommended by Gorsuch (2003) to ensure a viable structure. 

The internal structure with respect to item functioning has also been examined for the COR-1. 
Fantuzzo ct al. (2002) evaluated the five skill points of each COR-1 item to determine whether 
they represented a valid hierarchical developmental sequence. Using inscriptive statistics, the 
researchers found that more than one third of the items had irregular distributions, suggesting 
that these items’ skill points do not represent a progressive ability sequence. Currently, no pub¬ 
lished research has examined this important aspect of the skill points of the COR-2 items. 
Finally, studies of the COR-1 and COR-2 have investigated their validity based on relationships 
to other variables (Fantuzzo ct al.. 2002; Schwcinhart ct al.. 1993: Sekino & Fantuzzo. 2005). 
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However, this research is not reviewed here given Ihc shortcomings of the studies on the 
dimensionality and item skill points of the COR-1 and COR-2. 

Despite its widespread use. there is no published, peer-reviewed research on the psychometric 
quality of the COR-2. The purpose of the present study was to start to bridge this knowledge gap 
by investigating the validity of the COR-2 for children attending preschool in the context of 
urban poverty. The dimensionality of the COR-2 was rigorously evaluated using a series of 
factor analyses. This process included both a CFA of the six-factor structure posited by the 
developers as well as a multi-step exploratory and confirmatory examination of die internal 
structure. In addition. IRT methods were used to examine the five skill points of each item. This 
information was used to determine the extent to which the skill points correspond to a hierarch¬ 
ical developmental sequence. 


METHOD 


Participanls 

This study analyzed data from a larger study of a comprehensive early childhood educational 
program. The program consists of an evidence-based curriculum, a strong partnership with 
families, an evidence-based formative assessment, and professional development for teachers 
(Fantuzzo. Gadsden. & McDermott. 2010). Data for the present study included all children 
who were in a large, urban school district's Head Start program in 2006-2007. The analysis 
sample consisted of 4.071 children with COR-2 data for the fall, winter, and spring. Approxi¬ 
mately half of the children in the sample were male. 5% were Caucasian. 70% were African 
American. 18% were Latino. 3% were Asian. 4% were other, and 5% were English language 
learners. The average age of the children was approximately 3.5 years old. 

The school district in this study requires Head Start teachers to have a bachelor's degree and 
certification in early childhood education. The district also required the use of the COR-2 at the 
time to monitor children’s learning and development in areas important for school readiness (the 
COR-1 was used before the COR-2). The COR-2 was completed three times each year so that it 
could inform lesson planning and provide information on the extent to which objectives and 
standards were being met. A subset of teachers was recruited to attend the 2-day training on 
the COR-2 recommended by HighScopc. and they trained the other teachers at their school 
who had not attended the training (Waterman. McDermott. Fantuzzo. & Gadsden. 2012). 
Informal follow-up training was provided in subsequent professional development sessions 
(Waterman ct al.. 2012). 


Measures 

COR-2. The COR-2 is a widely used early childhood assessment of multiple domains of 
functioning important for school readiness (HighScope. 2010). It consists of 32 categorical items 
organized into six categories: Initiative. Social Relations. Creative Representation. Movement 
and Music. Language and Literacy, and Mathematics and Science (HighScope. 2010). Each 
of the 32 items contains five skill points ranging from less (1) to more (5) developed ability. 
Teachers record the highest skill point applicable for each item at a particular time. Skill points 
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arc then averaged across all items within each category to create category scores and across all 
categories to create a total score. 

Procedure 

Testing the fit of the six COR-2 categories. A CFA of the purported COR-2 structure 
was performed using data from the fall for the 4.071 children with COR-2 data from the fall, 
winter, and spring. The goal of this analysis was solely to test the fit of the six COR-2 categories 
to the data. Got such (2003) recommended using a sample of at least 400 to ensure stable correla¬ 
tions and a viable structure. The present study’s sample was well in excess of this sample size 
guideline. Prior to performing the CFA. we performed an item analysis to look for items with 
severely constricted variance, to look for items with floor and ceiling effects, and to screen 
for potential data entry errors. 

The raw data were used to estimate a categorical CFA model using Mplus 6.1 (Muthln & 
Mu thin. 2010). Mean and variance adjusted robust weighted least squares was used for esti¬ 
mation. Flora and Curran (2004) showed that this method yields robust estimates with varying 
sample sizes, degrees of nonnormality, and levels of model complexity. Per Brown (2006). 
model fit was evaluated based on (a) several global goodness-of-fit indices, (b) the modification 
indices and completely standardized expected parameter change (SEPC) estimates to identify 
specific areas of model misfit, and (c) the practical and statistical interpretability of the model 
and its parameters. This three-pronged approach avoided the common mistake of solely using 
goodness-of-fit indices to evaluate fit (Brown. 2006). Specifically, ’’the other two aspects of 
fit evaluation (specific areas of model misfit, parameter estimates) provide more specific 
information about the acceptability and utility of the solution" (Brown. 2006. p. 113). 

The global goodness-of-fit indices that were used as part of the model evaluation were the 
comparative fit index (CFT), the root mean square error of approximation (RMSEA). and the 
weighted root-mcan-square residual (WRMR). Yu (2002) found that the WRMR performs well 
with nonnormal data and model misspccifications. The chi-square test was reported as well, but 
it was not relied on for decision making because of its sensitivity to large sample sizes (Brown. 
2006). Simulation studies have found that good fit is characterized by CF1 > .950-.960. 
RMSEA < .050-.060. and WRMR < .950-1.000 (Hu & Bentler. 1999: Yu. 2002). 

Global goodness-of-fit indices may indicate an adequate fit even when some observed 
relationships arc not accounted for sufficiently (Brown. 2006). To identify areas of misfit, we 
inspected modification indices and SEPC estimates. Modificaiion indices estimate the expected 
change in the chi-square statistic if a parameter not freely estimated in the model is freed 
(Brown. 2006). Given the sensitivity of chi-square to sample size, modification indices were 
considered in conjunction with SEPC estimates. SPEC estimates indicate the expected change 
in a parameter if it is freely estimated and therefore suggests whether rcspccification will lead 
to a statistical and a practically meaningful improvement (Brown. 2006). The results were 
inspected for SEPC values that suggested an item would load salicntly (>.40) on a factor other 
than the one it was designated to load on a priori. Finally, the practical and statistical interpret¬ 
ability of the model—including the model specification: the size, significance, and direction of 
the factor loadings: and interfactor correlations—was used to evaluate fit (Brown. 2006). Robust 
model specification is indicated by having factors with four or more saliently loading items 
(Gorsuch. 2003). 
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To summarize. Ihc fu of the six-factor CFA mock I was evaluated using the following criteria: 
(a) CFI > .9SO-.960. RMSEA < .050-.060. WRMR < .950-1.000 (Hu & Bender. 1999: Yu. 
2002); <b) modification indices and SEPC estimates indicating areas of model misfit (Brown. 
2006 ); and (c) the practical and statistical intcrprctability of the model (c.g.. robust model 
specification, significant loadings, and reasonable interfactor correlations; Brown. 2006). 

EFAs. Hie sample was randomly divided into two subsamplcs of approximately 2,035 chil¬ 
dren: (a) an exploratory sample for the EFAs and (b) a reserve sample for the confirmatory 
analyses of the factor structure derived from the EFAs. This two-step approach is recommended 
by factor analysis experts (c.g., Fabrigar. Wegener. MacCallum. & Strahan. 1999) because the 
optimal factor structure is empirically uncovered using EFA. and then its fit to the data is 
cross-validated with CFA- Such an approach is especially appropriate when there is no or limited 
prior empirical and theoretical evidence to support using CFA (Brown & Moore. 2012). 

Two-stage maximum likelihood estimation was used to calculate a polychoric correlation 
matrix in MicroFACT 2.0 for the exploratory sample (Waller. 2001). Per Knol and Berger 
(1988). the matrix was smoothed to reduce the number of Hey wood eases (i.c.. communalitics 
>1) and to ensure positive semidefiniteness. To determine the initial number of factors to 
extract, we performed minimum average partialing (MAP) on the smoothed matrix (Veliccr. 
1976). The matrix was then used for iterative common factoring in SAS 9.3 using squared 
multiple correlations as initial communality estimates (McDermott ct al.. 2011). The analyses 
used varimax. cquamax. and promax rotational procedures. The optimal structure was the one 
that met the criteria used by McDermott ct al. (2011) for this type of analysis: (a) maximizes 
hyperplane count and item coverage indicating an approximation of simple structure (Yates. 
1987). (b) produces the smallest root-mcan-squarc residual (RMSR) and largest GF1 (Waller. 
2001). (c) has at least four salient items (loadings > .40) per factor, (d) results in internally 
consistent factors (r > .70). and (c) yields an uncomplicated structure aligned with theory and 
research (Fabrigar ct al.. 1999). 

CFAs. To confirm the optimal structure derived from the exploratory analyses, we submit¬ 
ted the reserve sample data to categorical CFA using the procedures specified previously. In 
addition, a higher order model was fit to provide a test for a general developmental status factor 
commonly found with many performance assessments (Gorsuch. 2003). The Schmid-Lciman 
transformation (Schmid & Lciman. 1957) is helpful in understanding higher order models by 
producing orthogonal first- and second-order factors and estimates of their unique contributions 
to explaining variance (Watkins. Wilson. Kotz. Carbone. & Babula. 2006). Per Brown (2006). 
an orthogonalized (i.c.. applying the Schmid-Lciman transformation) higher order model was 
estimated by specifying the number of second- and first-order factors suggested by theoretical 
and empirical evidence. Item loadings on the second-order factor were estimated by extracting 
the variance this factor explained (Brown. 2006). The first-order factor loadings were residua- 
lized of the variance explained by the second-order factor, leaving only the variance uniquely 
accounted for by the first-order fetors (Watkins ct al.. 2006). 

Developmental sequence. The data were analyzed to understand the functioning of the 
skill points of each item in terms of the ordering and spacing of the estimated parameters. To 
do this, we estimated thresholds corresponding to the lcveb of the latent trait that separated 
two adjacent skill points (Andrich. 2010). These parameters should increase such that the 
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threshold for the boundary between Skill Points 1 and 2 is less than the threshold for Skill Points 2 
and 3 (Bond & Fox. 2007). Thresholds that did not progress in this manner were identified as 
disordered or reversed (Bond & Fox. 2007). Thresholds were also inspected for reasonable spa¬ 
cing- which suggested that each skill point identified a unique level of the latent trait (Bond & 
Fox. 2007). 

Threshold reversals have several potential causes, including incorrect skill point ordering- 
multidimensional responses, and skill points with low frequencies (Adams. Wu. & Wilson. 
2012: Andrich. dc Jong. & Sheridan. 1997). Given the multiple potential causes of reversals, 
their interpretation is debated. Some experts argue that threshold reversals indicate "dear and 
unambiguous evidence of problems in the empirical ordering” of the skill points (Andrich 
ct al.. 1997. p. 70). Others argue that reversals do not necessarily point to a problem with the 
skill point ordering but agree that they indicate that an item is malfuiKtioning (Adams cl al.. 
2012). Both viewpoints do not recommend using remedial measures such as collapsing cate¬ 
gories and instead suggest that items with threshold reversals need to be reviewed and revised 
(Adams ct al.. 2012; Bond & Fox. 2007). 

In the present study. IRT methods were used to estimate the thresholds for each item. Both the 
partial credit model (PCM) and the generalized PCM (GPCM) were estimated using 
PARSCALE 4.1. To determine which model produced a better fit to the data, we compared 
the chi-squares and estimated internal consistency and information for each scale (du Toif 
2003: Embretson & Rcisc. 2000). Results from the selected model were used to detect reversed 
and potentially poorly spaced thresholds. 


RESULTS 


Item Analysis 

On average, item rcspemscs were not skewed (Af skewness = .05. range = -.40 to .73). but they 
were platykurtic (Af kurtosis = - .94. range = -1.42 to -0.36). Each of the skill points of every 1 
item was used, although some infrequently. Floor and ceiling effects were investigated by exam¬ 
ining the occurrence of extreme item response patterns (c.g.. 1 or 5 on every item). About 1% of 
the children had extreme response patterns, with most receiving a 1 on every item. Covariatc 
information was used to determine the likelihood that these patterns were due to data recording 
errors. In general, this information suggested that these responses were genuine (c.g., those who 
received all fives were typically older and did well on a validated measure of early academic 
achievement). Furthermore, there were only a few eases of extreme response patterns, and thus 
they were not removed from the data. 


Testing the Fit ol the Six COR-2 Categories 

The CFA mixfcl specification of the six COR-2 categories is displayed in Figure 1. This speci¬ 
fication indicated no double-loading items, uncorrclatcd measurement errors, correlated factors, 
and an ovcridcntificd model with df=AA9. Data from the fall for the 4.071 children with COR-2 
data at all three time points were used to estimate the model (see Table 1). 
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FIGURE 1 Ccofmnatecy xvlym model of six c*egOffa s of the Preschool Oukl Observation Record. Second 
Edition. Note that this figure proves a sunphfied preseexmeo and doe* not show all of the items (as indicated by 
the elites) or model piramctm. Please refer to Table 1 to sec the number and niture of items cc each factec. Categcme* 
tee from HighScope (2010) 

For the six-factor model, the CFI was .978, suggesting a good fit but at .066 and 2.602 the 
RMSEA and WRMR were higher lhan the criteria set for a good-fining model (* J (449) = 
8.498.46. p< .00001). In Edition. Brown (2006) noted that the CFI is more likely to suggest 
an acceptable fit than other indices because it compares the fit of the theoretical model to the 
fit of a model in which the items arc unrelated. The modification indices. SEPC estimates, inter- 
factor correlations, and factor specifications pointed to problems with this structure. The modi¬ 
fication indices and SEPC estimates suggested that many items on the Social Relations. 
Initiative, and Creative Representation factors would load salicntly (>.40) on more than one 
of these factors. This indicated that these three factors might have been belter represented by 
one factor, a hypothesis supported by the high correlations between these factors (M r = .96). 

As shown in Table 2. all six factors were highly correlated (M r = . 91. range = ,85-.97). 
A correlation of .91 indicates that only 17% of a factor’s variance is unique, whereas 83Vo is 


















WtESCHOOL CHILD OBSERVATION RECORD. SECOND EDCTION 1127 


TABLE 1 

Confirmatory Factor Analysis ReeuHs fOMhe Six Calefies of the Preschool CNkl Otxservatico Record. 

Second Edfcon (N-4.071) 


Item 

£ mm are 

Standardized 

SE emrnate 

SE 

1. Initiative 





A Making choices and pirns 

1 

0 

0.85 

0.01 

B. Solving prcfcleo* with materials 

0.97 

0X11 

0.82 

0X11 

C. Initiate* play 

1.02 

ox>i 

0.86 

0.01 

D. Takrg care of personal needs 

0.94 

0X11 

0.80 

0X11 

1 Social Relations 





E Relating to adults 

1 

0 

086 

0X1 

F. Relatmg to other children 

0.97 

0X11 

084 

0X11 

G. Resolving interpersonal conflict 

0.94 

0X11 

0.80 

0X11 

H Understanding and expressing fcelsgs 

1.00 

0X11 

086 

0.01 

3. Creative Representation 





L Making and building models 

1 

0 

086 

0.01 

J. Drawing and painting pictures 

0.98 

0X11 

084 

0X11 

1C Pretending 

1.00 

0X11 

086 

0.01 

4. Movement and Music 





L. Moving in various ways 

1 

0 

084 

0.01 

M. Moving with objects 

0.94 

0X11 

0.79 

0X1 

N. reeling and expressing steady beat 

1.03 

0X11 

0.86 

0.01 

O. Moving to music 

1.01 

0.01 

085 

0.01 

P.Svw 

1.00 

0X11 

085 

0X11 

5. Language and Laency 





0 Listening to and understanding speech 

1 

0 

0.89 

0.00 

R. Using vocabulary 

0.97 

0.01 

087 

0.01 

S. Using complex patterns of speech 

0.98 

0X11 

087 

0X11 

T. Showing awareness of sounds in «\xds 

0.96 

0.01 

0.86 

0.01 

U Demonstrating krewledge about books 

0.95 

0X11 

085 

0X11 

V. Using letter names ted sounds 

0.90 

0.01 

0.80 

0.01 

W. Reading 

0.89 

0X11 

0.80 

0X11 

X. Writing 

0.92 

0X11 

0.82 

0X1 

6. Matte mao; s ted Science 





Y. Sorting objects 

1 

0 

087 

0.01 

ZL Identifying patterns 

0.97 

0X11 

0.84 

0.01 

AA. Comparing properties 

1.02 

0.01 

089 

0.00 

BB. Countxg 

0.95 

0X11 

0.83 

ox>i 

CC. Identifying position and direction 

0.98 

0.01 

0.86 

0X1 

DD. Identifying sequence, change, ted causality 

1.03 

0X11 

0.90 

0.00 

EE. Identifying materials aod properties 

0.98 

0.01 

0.86 

0.01 

FT Identifying natural and tong things 

0.95 

0X11 

0.83 

0.01 


Net*, In the unstacdirdned modeL ae indicate* per factor was fixed to 1 to define the factor meenc (Brown, 2006). 
Categories and item titk* are from HighScope (2010). 


redundant. Many researchers have noted that high intcrfactor correlations (c.g.. greater than 
.80-.85) provide "strong evidence to question" the existence of distinct constructs (Brown & 
Moore. 2012. p. 373). Williams. Ford, and Nguyen (2002) noted that "most researchers would 
not want to attempt the argument that two factors with such a high correlation [referring to a 







1128 BARG1IAUS AND PANTOZZO 


TABLE 2 

IrteftncSor Correlatons Icy the Sc. Categories ot the Presetted Chid Observation Record. Second Edition 


C^gory 

R|P 

2 

s 

4 

5 

6 

1. Imamvc 


.97 

.96 


.93 

.90 

2. Social Kclaoam 



.9i 

JS6 

.92 


3. Cream* Representation 





.93 

.91 

4. Movement and Muik 






M 

$. Language and Literacy 






.95 

6. Mithrmaocs and Science 






— 


Sort. CucgonM ik bom HighScope (2010). 


conclalion of .86] arc meaningfully different" (p. 373). Thus, the high intcifacloi correlations 
suggested that a model with fewer factors might fit better. 

Finally, the six-factor model had one factor with just three items. This specification is prob¬ 
lematic because experts have noted that to robustly define a dimension four or more indicators 
arc needed (Gorsuch. 2003). In sum. the WRMR. changes suggested by the modification and 
SEPC estimates, high intcrfactor correlations, and model specifications problems all supported 
the conclusion that the six-factor model did not fit the data well. Using CFA may have been 
premature, as it relies on a robust theoretical and empirical foundation to determine the appro¬ 
priate number of factors (Brown & Moore. 2012. p. 373). Thus. EFA and CFA were used to 
empirically determine the optimal factor structure of the COR-2. 

EFAs 

The fall data on the 4.071 children were randomly divided into an exploratory in = 2,036) and a 
confirmatory (n = 2.035) sample. Two-stage maximum likelihood estimation was used to calcu¬ 
late a smoothed polychoric correlation matrix for the exploratory sample (Knot & Berger. 1988: 
Waller. 2001). The smoothed correlation matrix was submitted to MAP. which suggested four 
potentially viable factors, and therefore two- through six-factor solutions were evaluated. 

The four-factor promax (k = 4) model with initial cquamax rotation was determined to be the 
optimal solution, as it met all of the evaluation criteria. The six- and five-factor models produced 
factors with too few salient items, and the six-factor model did not reproduce the developer- 
defined categories. Solutions with less dun four factors generally reproduced the factors in 
the four-factor model in a more complex manner (c.g.. mote items loading on two factors, less 
coverage of the items, lower hypcrplane count). The four-factor model jointly maximized hyper- 
plane count and item coverage, providing the best approximation of simple structure. The fit 
indices suggested that this model provided a good fit to the data (GFI = .9995 and RMSR = .02). 
The four-factor model retained five or more salient items per factor, and all of the factors were 
internally consistent (range = ,89-.96). A total of 31 of the 32 items were salient, with only Item 
U. "Demonstrating knowledge about books." not loading salicntly on any factor. Item R. 
“Using vocabulary." loaded on two factors and was removed from both per Comrcy’s 
(1988) recommendation, resulting in 30 items being retained. 

Table 3 provides the final factors and their component items and pattern loadings. These 
results indicated that the four-factor model met the final EFA evaluation criteria of making 
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TABLE 3 

Ecjjamax Prcroax (*-4) Rotated Factor Pattern Loadings for the Preschool CNW Observation Record. 

Second Edftton 


htm 

1 


2 

3 

4 

l. Social Engagement 






E Relating to adults 

M 


-.09 

.00 

.14 

P. Relams to other children 

M 


.02 

.03 

XU 

A Staking choices and plans 

.79 


.07 

-.10 

.11 

G. Resolving interpersonal conflict 

M 


.09 

.01 

.11 

C. Imnamg play 

M 


.13 

.26 

-.09 

S. Using complex patterns of speech 

M 


-.05 

-.03 

.39 

B. Solving prcfclem* with materuls 

M 


.14 

.17 

-.05 

1C Pretending 

St 


.08 

.27 

-XU 

H Understanding and expressing fcelmg* 

sa 


-.06 

.20 

.16 

D. Takxg care of personal needs 

il 


.11 

.29 

-X7 

0- Listening to and understanding speech 

M 


.06 

.07 

J1 

L Making and budding models 

AS 


.16 

.23 

.10 

1 Cognitive Skills 






V. Using letter names acd sounds 

-X6 

.98 

-.14 

.09 

X. Writing 

&> 


.89 

-.02 

-XT 

BB.Counmg 

-XU 


.66 

.08 

.16 

Y Sorting objects 

XU 

.52 

.11 

J6 

W. Reading 

XU 

.50 

.05 

27 

Z. Ideotrfm pure mi 

-XU 


.48 

.10 

J3 

T. Showing awareness of sounds in words 

.17 

.47 

.02 

22 

J. Drawing and painting pictures 

JO 

.41 

.28 

-.08 

3. Coordinated Movement 






N. Peeling and expressing steady tear 

-XU 


.01 

.85 

X5 

O. Moving to music 

XX 


-.02 

.81 

.08 

P.S^t 

X* 


-.05 

.66 

.19 

L. Moving in various ways 

.13 


-.04 

.64 

.13 

M. Moving with objects 

-Cca 


-.02 

.48 

.26 

4. Scientific Process Skills 






EE. Identifying materials and properties 

-.12 


.10 

.13 

M 

FT* Identifying natural and living things 

-XU 


.01 

.15 

.79 

CC. Identifying position and direction 

.12 


.08 

.01 

.68 

DD. Identifying sequence, change, and causality 

.15 


.10 

.02 

.68 

AA Comparing properties 

XU 


.20 

.11 

S7 

hems not included in final solution 






U Demcestrating knowledge abcut books 

JO 

.28 

.10 

21 

R. Using vocabulary 

.41 

.14 

-.03 

.40 


Nctc. Salient pattern loadings (>.40) are in bold. Item R kuded salidtly on the Social Engagement and Scientific 
Process Skills factors and was removed freen both. Item tides are from HtghScope (2010). 


theoretical sense. Based on the loadings, the factors were named Social Engagement (e.g.. 
“Relating to other children”). Cognitive Skills (e.g.. "Counting”), Coordinated Movement 
(e.g., “Moving to music' 1 ), and Scientific Process Skills (e.g.. “Identifying natural and living 
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CFAs 

The confirmatory sample <n = 2.035) was used to lesi ihc fit of the four-factor. 30-item EFA 
solution using CFA. The CFI (.980) indicated that this model provided a reasonable fit. but 
the RMSEA (.062) was slightly higher than the established criteria, and the WRMR (1.817) 
did not meet the criteria for good fit (/ 2 (399) = 3.521.57. p< .00001). Akaike's information 
criterion (A1C) for the four-factor solution was less than the AIC for the six-factor solution 
(estimated using the confirmatory sample), indicating that the four-factor model fit the data bet¬ 
ter (2.239.43 vs. 2.956.95. respectively). The modification indices and SEPC estimates indicated 
that some items would load salicntly on another factor. However, fewer changes were suggested 
for the four-factor model than for the six-factor model, again supporting the simpler model (8 vs. 
19 suggestions, respectively, of salient cross-factor item loadings). Still, high interfactor correla¬ 
tions (.« r = .89. range = .83-.94; see Table 4) called into question the extent to which the four 
factors represented distinct constructs. 

An orthogonalized second-order factor analysis was also performed (Brown. 2006; Watkins 
ct al.. 2006). The second-order factor loadings, residualized first-order loadings, variance 
explained by the second- and first-order factors, and communalitics are shown in Table 5. Every 
COR-2 item loaded salicntly on the second-order factor, whereas none of the residualized 
first-order loadings were salient. Examining the variance explained by the second- and 
first-order loadings revealed that the second-order factor largely accounted for the variance. 
Overall, the second-order factor accounted for 64% of the total variance and 90% of the common 
variance, whereas the first-order factors collectively accounted only for 7% and 10 %, respect¬ 
ively. The results suggested that a second-order factor may have accounted for the high intcrfac- 
tor correlations. However, it should be noted that the second-order model did not meet all of 
the established fit criteria, with a CF1 of .98. an RMSEA of .065. and a WRMR of 1.94 
(/ 2 (401) = 3.819.99. p< .00(01). Still, collectively, these and the previous results raised 
questions about the utility of the four-factor model. 

Developmental Sequence 

To further investigate the four-factor model, we tested the individual items for threshold rever¬ 
sals and for reasonable threshold spacing using the full sample (N = 4.071). The items were 
examined by each factor because unidimensionality is an assumption of standard IRT models 
(Embretson & Rcise. 2000) and because multidimcnsionality may contribute to threshold rever¬ 
sals (Andrich ct al.. 1997). The assumption of unidimcnsionality and the concern that multi¬ 
dimensionality may be causing reversals were addressed by examining the thresholds by factor. 


TABLE 4 

Interlaetor Correlatais tot the Four-Factor Medal 


Factor 

1 

2 

S 

4 

1 . Social Element 


.91 

.90 

.92 

2 . Cognitive Skills 


— 

.83 

.91 

3. Coordinated Movement 




An 

4. Scxntrfk Process Skills 
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TABLES 

Otho^jralzed Higher Outer Motel Leadings and Variance Expaned 



Gemrat 

Social 

Cogmsve 

Coordinated 

Scuntfic 



status 

en^eouns 

skiNs 

snenemeni 

process 


Item 

Est 

V* 

Est 

Vrt 

Est 


Est 

V* 

Est 

%v 


E. Relate to adihs 

-SI 

.66 

2S 

.05 







.71 

P. Relae to chlitti 

.77 

.60 

.22 

.05 







.64 

A. Choices and pirns 

.so 

.64 

2S 

.05 







.69 

G. Resolve conflict 

.73 

.54 

2\ 

.04 







58 

C Initiating phy 

SI 

.67 

2S 

.05 







.72 

S. Ccaipkx speech 

si 

.68 

25 

.05 







.73 

B. Solve problems 

.78 

.61 

-22 

.05 







.66 

K. Prelroimg 

s\ 

.66 

.23 

.05 







.71 

H. Peelings 

so 

.64 

2S 

.05 







.69 

D. Personal needs 

.76 

.57 

21 

.05 







.62 

0. Understand speech 

ss 

.72 

-24 

.06 







.78 

L Models 

SI 

.67 

.23 

.05 







.72 

V. Loners 

.75 

.56 



25 

06 





52 

X. Writing 

.7* 

.61 



26 

xrr 





57 

BB. Counting 

SO 

.64 



26 

xrr 





.71 

Y. Sanmg obexes 

SS 

.71 



26 

M 





.79 

W. Reading 

.76 

.58 



25 

06 





.64 

Z Identity patterns 

S 1 

.66 



27 

xrr 





.73 

T. Sounds m words 

-SI 

.66 



27 

xrr 





.73 

J. M tkm* pictures 

so 

.63 



26 

xrr 





.70 

N. Steady t<* 

.79 

.63 





-36 

.13 



.76 

O. Move to rzttoc 

.7S 

.61 





J6 

.13 



.74 

P.Smsms 

.76 

.58 





-35 

.12 



.71 

L Move vinous ways 

.76 

.58 





-35 

.12 



.70 

M Move with objects 

.72 

.52 





J3 

.11 



54 

EE. Materials and propcrtxs 

SS 

.69 







.23 

05 

.74 

PP. Living thrigs 

SI 

.66 







.22 

05 

.70 

CC. Position md direction 

SS 

.69 







.23 

05 

.74 

DD. Sequence zrA ciusality 

S7 

.76 







.24 

06 

SI 

AA. Properties 

S6 

.74 







.23 

05 

.79 

% Total vonince 


63.8 


2.1 


is 


10 


0.9 

705 

% Common vonince 


90.4 


2.9 


16 


19 


1.2 

100 


Note. Lrudinp w at Iruuloiiwd with U* Schmid Lcimin pi«t«*jrc hrto Olle* arc Irom HighSccfc (2010). 
Eil — laiKc lg~lr.;. %v — percrer vanancc explained: A 1 — cocimwnility. 


The four factors were each calibrated via the PCM and GPCM- A chi-square difference test of 
model fit indicated that for each of the four factors the GPCM fit better. Furthermore, for each of 
the factors, the GPCM yielded higher internal consistency and total test information (calculated 
as the inverse of test error: 1/S£ 2 ). Thus, the GPCM was retained to evaluate the item skill 
points. Table 6 provides the GPCM parameter estimates for each item by factor. The first col¬ 
umn. i,. provides estimates of Item Ps discrimination parameter, which indicates the degree to 
which skill point selection varies by ability level (Embretson & Rcisc. 2000). For Social Engage¬ 
ment. the item discriminations ranged horn 0.86 to 1.29 (.W i = 1.07). For Cognitive Skills. 
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TABLE 6 

Parameter Estimates From the GPCM (A/-4.071) 

Skill pom it let non 

GfCM paramr/rr titunaus ptrctnmes 


Jum 


SE i u 6 a 

in 

1 

2 

S 

4 

5 

Social Engagemecc 

R Relating to oiuhi 

1.14 

0.03 -0.64 -1.14 

0.12 0.84 

16 

• 

30 

26 

21 

P. Relating to cither children 

1.07 

0.03 -1.06 -0.98 

0.01 0.65 

12 

12 

27 

25 

25 

A Making choices and plans 

1.29 

0.03 -1.47 -0.74 

0.51 1.45 

9 

19 

39 

23 

10 

G. Resolving interpersonal confbct 

0.96 

0.02 -1.38 -0.05 

0.75 150 

13 

33 

28 

17 

9 

C. Initialing play 

1.17 

0.03 -1.54 -1.09 

0.48 051 

7 

14 

39 

16 

24 

S. Using complex patterns of speech 

1.15 

0.03 -1.32 -0.51 

0.44 059 

11 

22 

28 

16 

23 

B. Solving prcfcfcms with materials 

1.02 

0.02 -1.58 -0.48 

0.32 159 

9 

25 

29 

25 

12 

1C Prerending 

1.18 

0.03 -4).90 -0.70 

-0.10 155 

15 

14 

23 

35 

13 

H Understock and expressing feeling*- 

0.86 

0.02 -0.20 -0.63 

0.69 053 

27 

13 

24 

12 

23 

D. Taking care of personal needs 

0.89 

0.02 -2.92 -1.15 

0.23 050 

2 

16 

34 

22 

26 

0 Listening to and understanding speech 

1.11 

0.03 -0.58 -0.26 

-0.07 0.77 

23 

16 

15 

23 

22 

L Makxg and budding models 

1.02 

0.03 -0.85 -0.54 

058 152 

18 

18 

28 

23 

13 

Cognitive Skills 









V. Using letter names oed sounds- 

1.10 

0.03 -0.10 0.61 

0.40 155 

40 

23 

11 

16 

10 

X Wntmg- 

1.23 

0.03 -0.16 -0.56 

0.89 2.12 

30 

11 

35 

19 

4 

BB. Counting 

1.13 

0.03 -1.09 -0.26 

0.70 1XO 

16 

25 

29 

16 

15 

Y. Scctng objects 

1.27 

0.03 -0.85 -0.26 

0.62 159 

20 

22 

29 

19 

10 

W. Reading 

1.04 

0.03 -1.65 -0.07 

053 155 

10 

32 

25 

30 

4 

Z. l^nnlying P«W 

0.93 

0.02 -0.24 -0.09 

1.12 051 

32 

20 

25 

10 

14 

T. Shoeing awareness of sounds in words 

1.23 

0.03 -0.86 0.72 

0.73 1.78 

23 

43 

14 

14 

6 

J. Drawing and puntmc psetures' 

0.83 

0.02 -0.60 -1.27 

0.74 106 

16 

10 

39 

20 

15 

Coordinated Movement 

N. Peehng and expressing steady beat 

1.49 

0.04 -1.55 -0.41 

0.50 0.66 

9 

26 

28 

15 

22 

O. Moving to music 

1.33 

0.04 -1.50 -0.19 

055 0.77 

10 

29 

22 

18 

21 

P.Snipo** 

1.09 

0.03 -1.26 -1.58 

0.52 059 

8 

7 

42 

16 

27 

L Moving in vaoaus ways'* 

0.81 

0.02 -2.33 -0.83 

0.11 -0.18 

4 

18 

23 

16 

39 

M. Nfosxig with objects 

0.67 

0.02 -1.13 -1.08 

0.72 152 

12 

16 

37 

22 

13 

Saecafk Prcccss Skills 

EL Identifying maiemls and properties 

1.35 

0.04 -0.27 -0.17 

0.87 1.42 

32 

16 

28 

15 

9 

FT Identifying namral and bvmg tfamg*- 

1.06 

0.03 -0.48 -0.10 

0.93 0.73 

27 

21 

24 

11 

16 

CC. Identifying petition and direction 

1.34 

0.04 -0.76 -0.02 

0.82 228 

23 

26 

28 

20 

3 

DD. Identifying sequence, change, acd causality- 

1.58 

0.05 -0.50 0.17 

1.07 059 

30 

26 

23 

7 

14 

AA Comparing properties 

1.34 

0.04 -0.52 -0.01 

0.41 1.96 

27 

20 

21 

27 

5 


Soft. SE is the sto&iird error tor the slope parameter estimate; it) ts the threshold parameter, with each item having 
f«r tfcrwfcolds (;). or aoe fewer than the number of skill pons (j-m- 1). Item tales are from HighScope (2010). 
GPCM - generalized pirtial credit model. 

'Item thresholds ace disordered. 

discriminations ranged from 0.83 to 1.27 (M a = 1.10). hem discrimination parameters for 
Coordinated Movement ranged from 0.67 to 1.49 (A# x= 1.08). Finally, discrimination ranged 
from 1.06 to 1.58 (M a = 1.33) for Scientific Process Skills. 

The Sij columns provide the item threshold parameter estimates, which indicate the level 
of the latent trait that separates two adjacent skill points (Embrctson & Rcisc. 2000). The Sij 
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therefore represent the intersection point of the category characteristic curves (CCCs) for 
adjacent skill points (Adams ct al.. 2012). The CCCs show the probabilities of a child being 
rated at each of the skill points as a function of his or her ability with respect to the underlying 
latent trait (Embrctson & Rcise. 2000). Figure 2 shows the CCCs generated by PARSCALE 4.1 
(Muraki & Bock. 2002) for Item Y. which has ordered and well-spaced thresholds. 

Of the 30 items in the four-factor model. 10 displayed threshold reversals (these items arc 
indicated in Table 6). For example. Figure 3 shows the CCC for Item V, "Using letter names 
and sounds." on the Cognitive Skills factor. Examining this CCC. it is clear that Thresholds 

2 and 3 are reversed. Threshold 2. between Skill Point 2 (naming letters) and Skill Point 3 
(making a letter sound), is located at a higher point along the cognitive ability continuum than 
Threshold 3. which lies between Skill Points 3 and 4 (which is about naming more letters than 
Skill Point 2). A possible explanation for this reversal was found in the research on early reading 
skills (c.g.. MeBride-Chang. 1999; Treiman. Tincoff. Rodriguez. Mouzaki. & Francis. 1998). 
which suggests that making a letter sound (Skill Point 3) is potentially a more advanced skill 
than naming letters (Skill Points 2 and 4). Thus, moving from Skill Point 2 to 3 (Threshold 
2 ). naming letters to making a letter sound, may be more difficult than going from Skill Point 

3 to 4 (Threshold 3). making a letter sound to naming letters. Because reversed thresholds 
may result from infrequently selected categories (Adams et aL. 2012). the category response 
percentages arc also presented in Table 6. Half of the 10 items with disordered thresholds 
had one or more skill point that was selected for 10 % or less of the sample, but none of the 
points were never used. 



A.*t, 

FIGURE 2 Category charartmsu carve (or lion Y with ordered obS well-spiced threshold! Graphs were gatrtaled by 
PARSCALE 4.1 (Mwafa * Bod. 2002). 
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AHK, 

FIGURE 3 Onego? cfuiictenitic cun? la tom V w*h a Ihrcihold reversal. A, am the threshold puimclm. Each 
Itan (i) his lc*a thresholds (j\ Graphs were generaltd by PARSCALE 4.1 (Muraki 4 Bock. 2002). Threshold milkers 
wee >ilcd. 


Finally, an exploratory inspection of the threshold spacing was performed using the results in 
Table 6. Four of the COR-2 items <C. F. T. and M) had pairs of adjacent thresholds that appeared 
to be close to one another on the latent trait continuum (i.c.. a difference of less than .10). For 
Item M. "Moving with objects." Threshold 1 (between Skill Points 1 and 2) corresponded to an 
ability level of -1.13. whereas Threshold 2 (Skill Points 2 and 3) was located at an ability of - 
1.08 (see Figure 4). Skill Point 2 was the most likely response for only a small portion of the 
Coordinated Movement continuum, in contrast, the thresholds of Item Y appear to be well 
spaced, suggesting that each skill point identified a distinct point on the Cognitive Ability con¬ 
tinuum (see Figure 2). 

As an exploratory and remedial measure, the 10 items with disordered thresholds were 
removed from their corresponding factors, resulting in two of the four factors retaining only three 
items. Gorsuch (2003) noted that to robustly define a factor, four or more indicators arc needed. 
Thus, an exploratory investigation of the dimensionality of the 20 items from the four-factor 
model with ordered thresholds was undertaken. These items were submitted to EFA and CFA 
to determine the optimal factor structure. Using the exploratory sample. MAP of the smoothed 
polychoric correlation matrix suggested two potentially viable factors. A three-factor model 
was also tested, but it did not retain four salient loadings on the third factor. In the two-factor 
model, one of the items loaded on two of the factors and was removed from both. The two factors 
were interpreted based on items with the largest loadings, which indicated that they represented 
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FIGURE 4 Category chanKtciuOc curve lot Item M «Uh [ccely spued threihoMi Griftu mar gesxralcd by 
I’ARSCALE 4.1 (Munfa * Bock. 2002). 


cognitive and noncognitivc skills. Using ihc reserve sample. Ihc CFA of the 19-item two-factor 
model indicated that this model did not fit well in terms of the criteria set for the global fit indices 
(CFI = .981. RMSEA = .077. WRMR = 1.909. jftlSl) = 1.956.04. p< .00001); in addition. the 
factors were highly correlated (» = .93). Fmally. a onc-factor model was fit using the 20 items, but 
its global fit indices were worse than those of the two-factor model. 


DISCUSSION 

This study provided the first independent investigation of the second most widely used multi¬ 
dimensional assessment in Head Start—the COR-2. The internal structure of the COR-2 was 
examined by (a) testing the fit of the six dcvclopcr-dcfmcd categories to the data, (b) empirically 
deriving the optimal factor structure, and (c) testing the hypothesis that the five skill points of 
each item represented appropriately sequenced and reasonably spaced response options. The 
modification indices and very high intcrfactor correlations indicated problems with the model 
based on the developer-defined categories. EFA. CFA. and high-order factor analysis were used 
to empirically derive the optimal internal structure. A four-factor model fit the data better than 
the six-factor model. However, high intcrfactor correlations were again found, calling into ques¬ 
tion the extent to which the factors were distinct An orthogonalizcd second-order factor analysis 
suggested that the four first-order factors explained only a small proportion of the variance com¬ 
pared to the second-order factor. Finally, an examination of the skill points of the items indicated 
that about a third had threshold reversals and four had poorly spaced thresholds. 
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The only other unpublished study of the COR-2 did not test the fit of the six categories 
(HighScopc. 2010). Thus, currently there is no empirical research available on the COR-2 to 
support the use of scores based on the six categories. The present study also found very high inter¬ 
factor correlations for all of the structures assessed. Lower interfactor correlations have been found 
for other validated early childhood assessments, such as the Learning Express (M r = .66) and 
Lcaming-to-Learn Seales (Af r=.61; McDermott ct aL. 2009. 2011). However, the Learning 
Express and Lcaming-to-Learn Seales may be the exceptions, as highly correlated factors have 
been found for other widely used early childhood assessments. For example, in an investigation 
of the Wcchslcr Intelligence Seale for Childrcn-Fourth Edition. Watkins ct al. (2006) found initial 
support for a four-factor solution with high intcrfactor correlations. However, an orthogonalized 
higher order model revealed that a general factor explained the majority of common (76%) and 
total (47%) variance (the first-order factors collectively explained 15% of common and 24% of 
total variance; Watkins ct al.. 2006). Watkins ct al. (2006) concluded that given the weak explana¬ 
tory power of the first-order factors, it would be a mistake to favor their interpretation over the 
general factor. A similar conclusion could be tentatively drawn from the findings reported here. 

The threshold reversal and spacing issues uncovered for the COR-2 arc in line with the con¬ 
clusion of Fantuzzo et al. (2002) for the COR-1 that many of the items have skill points that do 
not indicate a developmental progression. However, the 1RT methods used in the present study 
to investigate item functioning arc not common in evaluations of early childhood assessments 
(Gordon. Fujimoto. Kacstncr. Korenman. & Abner. 2012). StilL this small research base demon¬ 
strates that other measures have similar issues with their items functioning. Andrich and Styles 
(2004) found that many of the items on the Early Development Instrument had disordered 
thresholds. Gordon ct aL (2012) found that all of the items on the Early Childhood Environment 
Rating Seale-Revised had threshold reversals and about two thirds also had poorly spaced 
thresholds. Both studies recommended further development using information gleaned from 
modem psychometric methodologies not available when many of these measures were first 
developed (Gordon ct al.. 2012). This recommendation is also applicable for the future develop¬ 
ment of the COR-2, as discussed next. 


Implications for Future Research and Policy 

The purpose of this research was to examine the psychometric quality of the COR-2 for use as a 
multidimensional assessment in Head Start. However, the study is limited by the nature of the 
Head Start sample and the characteristics of the teachers. This study provided an exploratory 
investigation of the use of the COR-2 in Head Start in one large, urban school district that 
was primarily serving an African American population. The data analyzed were collected by 
Head Start teachers who were required by the district to have at least a bachelor's degree as well 
as a certification in early childhood education. Teachers’ knowledge and skills all influence their 
ability to effectively use an assessment and help to determine their training needs, such that 
teachers with fewer qualifications need more training (Mathematics Policy Research. 2007). 
The results of this study may not be generalizablc to teachers with fewer qualifications and/or 
or different training, to other early childhood programs, or to children from other ethnicities and 
locales (c.g.. rural areas). Additional studies arc needed to determine whether the issues 
uncovered here apply broadly. 
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This study provides an indication of where the COR-2 is in its scientific development at 
this time and points to several paths for future research and development. Specifically, the find¬ 
ings presented here highlight the need to further develop the COR-2 in terms of the constructs it 
aims to measure and the items it uses to do so. Per Downing and Haladyna (2006), an iterative 
cycle of systematic development and validation should be used for all areas of future work. 
This iterative cycle involves using information collected during development to inform vali¬ 
dation and evidence collected during validation to inform further development (Downing & 
Haladyna. 2006). 

Like many early childhood assessments, the COR-2 aims to measure six constructs essential 
for school readiness. Measures that only provide information on a subset of these constructs or 
on general developmental status have less practical utility for practitioners and policymakers. 
This study’s findings do not support the use of scores based on the six categories of the 
COR-2 and indicate that further development is needed. Future work should use an iterative 
cycle of construct development and validation, such as the evidence-centered design approach 
suggested by Mislcvy and Riconsccntc (2006). The construct development phase of evidence- 
centered design includes working with teachers, experts, and researchers to delineate the facets 
of the construct (Mislcvy & Riconsccntc. 2006). This process provides a rigorous approach to 
construct development that yields validity evidence based on content and precise and distinctive 
construct definitions to be used for item development (Mislcvy & Riconsccntc. 2006). It also 
helps to ensure that a measure is created that can be used to make valid and reliable inferences 
about each of the constructs targeted. 

For each of the COR-2 items, the skill points purport to capture the developmental sequence 
of the construct facet represented by the item title (c.g.. for Item M. "Moving with objects" is 
the title). However, this study revealed problems with the sequencing and spacing of some of the 
items’ skill points. As the functioning of the COR-2’s items was previously unexamined, future 
research should replicate these results with a large representative sample of children and 
teachers. Based on this information and the results of this study, problematic items should be 
flagged for redevelopment using qualitative and quantitative procedures (LcBocuf. Fantuzzo. 
& Lopez. 2010). The qualitative investigations would involve having external subject matter 
experts identify potential sources of malfunctioning (c.g., invalid sequence: Gordon ct al.. 
2012 ). In addition, teachers could be asked to "think aloud" as they observe, take notes, and 
use their observations to respond to the items (Cook & Beckman. 2006). Findings from this 
research would identify problems with the response process, such as difficulties in interpreting 
the skill points (Gordon ct al.. 2012). The items would then be revised based on the information 
gathered and knowledge from current developmental theory and research (LcBocuf ct al.. 2010). 
Quantitative investigations using IRT modeling can then empirically confirm that the revised 
items arc functioning as intended. 

There arc many challenges currently facing Head Start, however, arguably none is more press¬ 
ing than the need for high-quality assessments (Advisory Committee on Head Start Research and 
Evaluation. 2012: NRC. 2008). Assessments leading to invalid and unreliable inferences result 
in decisions with potentially negative consequeiKCs for children, teachers, and programs 
(McDermott ct al.. 2011). The need for psychomctrically sound assessment has received increas¬ 
ing federal attention. The Improving Heal Start for School Readiness Act of 2007 contained for 
the first time the requirement that all programs use scientifically based measures. More recently, 
the U.S. Department of Education’s Race to the Top—Early Learning Challenge called for states 
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to implement comprehensive early childhood assessment systems that consist of scientifically 
based measures applicable to the diverse populations served. As the demand for the use of 
evidence-based assessments increases, so docs the need to provide comprehensive information 
on the quality of measures (NRC. 2008). 

There is a significant need to examine what is mandated and the evidence-based capacities of 
widely used assessments. The NRC (2008) provides an overview of the standards for scientifi¬ 
cally based measures as well as a table of widely used assessments. However, this report docs 
not apply the quality standards to the table of assessments, calling for the field to provide the 
evidence to bridge this gap (LcBocuf ct al.. 2010). The present research responded to the NRC’s 
call by investigating the psychometric quality of the second most widely used multidimensional 
assessment in Head Start—the COR-2. This study is part of a growing body of research (c.g.. 
Fantuzzo. McDermott. Manz-Holliday. Hampton. & Burdick-Alvarcz. 1996: LcBocuf ct al.. 
2010 ) that investigates early childhood assessments that arc widely used with preschool children 
from low-income households. Without scientific investigation of the quality of assessments, the 
efficacy of educational programs for young children at high risk for academic failure is in serious 
jeopardy. 

The results of this study also have larger implications for policy. Multidimensional assess¬ 
ments are essential for Head Start to meet its goal of promoting school readiness across impor¬ 
tant domains discussed in the Child Development and Early Learning Framework, including 
“physical well-being, social and emotional development, approaches toward learning, language 
and literacy skills, and cognitive and general knowledge skills” (Office of He*J Start. 2010. 
p. 6). To align with Head Start’s guiding framework for child outcomes, assessments must be 
capable of providing scientific evidence on each domain. In addition to providing data on the 
state of the school readiness domains. Head Start programs are also held accountable for 
monitoring progress in these areas. For programs to do this, assessments with items that reflect 
a valid developmental sequence and therefore arc able to capture growth arc needed. Beyond 
serving an accountability function, such assessments also improve practice by monitoring 
children’s progress and guiding instructional decisions to create appropriate opportunities for 
further development. Thus, multidimensional assessments capable of measuring growth provide 
the actionable intelligence for policymakers and teachers to fulfill Head Start's school readiness 
goal. 

In addressing die pressing need and mandates for scientifically based measures, it is essential 
that the evidence on the psychometric integrity of early childhood assessments is shared with 
leaders in early childhood education. These individuals need access to independent evaluations 
of die assessments they arc considering using in their programs. To ensure that psychometric 
evidence is available and disseminated beyond academic journals and circles, a consumer guide 
for practitioners and policymakers containing such information is ncctfcd. 

To move beyond simply mandating the use of scientifically based measures, a national 
research agenda is needed to ensure that the school readiness of the most vulnerable children 
is scientifically measured. This agenda should include assessment evaluation, development, 
and information dissemination. To pursue these important but difficult tasks, the government 
must make appropriations to financially support these efforts. There is a clear need for scientifi¬ 
cally based measures and calls for their use have been made. These calls must be answered with 
systematic and rigorous science to provide evidence that supports the quality education all young 
children deserve. 
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